Setting the Environment

The task has the specification to develop the solution in shiny using shiny.semntic 4.0.0. We will use, data.table for data I/O and data wrangling since it excells on speed and and memory footrint (see the automated benchmarks from H2O. Also, all data that will be used by the shiny server will be saved in fst format, one of the fastest R binary formats.

library(data.table, warn.conflicts = FALSE)
library(magrittr) 
library(googledrive)
library(ggplot2)
library(fst)
library(geodist)

Exploratory data analysis

In this section we download data with a non-interactive authorization retaining a controled access to the google drive repo and ensuring reproducibility throughout the process. Then the data are loaded for Exploratory Data Analysis.

Calculations are performed with assignment by reference using the := operator of the data.table extension to data.frame. We will use this technique extensively during most of the computations due to its optimized performance since it performs minimal copies on memory. Also, data wrangling is performed within the data.table environment to take advantage of the optimised backend. We carefully use vectorised operations avoiding for loops. This, is an essential technique in high-level programming languages since reduces function which are costly.

Download Data

We will setup a process that requires no user interaction. This process ensures reproducibility as the document will render with no intervention and is key to maintainability, continues integration, development and testing. Also we will get familiarized with integrating Google Cloud Services with data pipelines using a service account.

To ensure no human involvement the most appropriate token to use is the service account token. Boldly speaking there are two steps involved in the process,

Step 1: Get a service account and then download a token in json format.

Step 2: Call the auth function proactively and provide the path to your service account token.

Concerning the first step a google account is needed. The definition can be done through the Google Cloud Platform project (see Documentation). The process is as follows:

  1. Navigate from the Developers Console to your service account.

  2. Follow the steps to define the service account.

Do Create key and download as JSON. This is the service account token.

Through the service account, the computing session has limited visibility to google drive. Only files shared with the service account can be accessed. Since I’m not an owner of the shared dataset I created a copy on my drive and shared via the email of the service account.

Notice: In a CI/CD case where the process is committed and pushed the .json should be encrypted.

In the following screen shot is the definition of the service account. The developer should further define that is for google drive API usage.

“Service Account screenshot”

dir.create("dataset/", showWarnings = FALSE)
options(gargle_oauth_email = TRUE)
drive_download("ships_04112020.zip", path = "dataset/ships.zip", overwrite = TRUE)
## File downloaded:
##   * ships_04112020.zip
## Saved locally as:
##   * dataset/ships.zip

Load Data

  • Unzip data and report file size
ships_dt <- fread('unzip -p dataset/ships.zip', stringsAsFactors = TRUE)
str(ships_dt)
## Classes 'data.table' and 'data.frame':   3102887 obs. of  20 variables:
##  $ LAT        : num  54.8 54.8 54.8 54.8 54.7 ...
##  $ LON        : num  19 19 19 19 19 ...
##  $ SPEED      : int  99 100 102 102 102 102 101 101 102 104 ...
##  $ COURSE     : int  200 200 196 198 196 198 197 194 199 199 ...
##  $ HEADING    : int  196 196 196 196 195 196 196 196 196 197 ...
##  $ DESTINATION: Factor w/ 646 levels "0046705808","3.E MAN",..: 151 151 151 151 151 151 151 151 151 151 ...
##  $ FLAG       : Factor w/ 45 levels "--","AG","BB",..: 31 31 31 31 31 31 31 31 31 31 ...
##  $ LENGTH     : int  100 100 100 100 100 100 100 100 100 100 ...
##  $ SHIPNAME   : Factor w/ 1187 levels ". PRINCE OF WAVES",..: 503 503 503 503 503 503 503 503 503 503 ...
##  $ SHIPTYPE   : int  7 7 7 7 7 7 7 7 7 7 ...
##  $ SHIP_ID    :integer64 2764 2764 2764 2764 2764 2764 2764 2764 ... 
##  $ WIDTH      : int  14 14 14 14 14 14 14 14 14 14 ...
##  $ DWT        : int  5727 5727 5727 5727 5727 5727 5727 5727 5727 5727 ...
##  $ DATETIME   : POSIXct, format: "2016-12-19 11:29:01" "2016-12-19 11:31:02" ...
##  $ PORT       : Factor w/ 6 levels "gdansk","gdynia",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ date       : IDate, format: "2016-12-19" "2016-12-19" ...
##  $ week_nb    : int  51 51 51 51 51 51 51 51 51 51 ...
##  $ ship_type  : Factor w/ 9 levels "Cargo","Fishing",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ port       : Factor w/ 6 levels "Gdańsk","Gdynia",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ is_parked  : int  0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, ".internal.selfref")=<externalptr>

We observe that the dataset is relatively “healthy”. There are some missing values, while the column names are descriptive. There are ~3 Million observations.

Tidy data

We remove rows with NA value for LON and LAT

nrow_init <- ships_dt %>% nrow
ships_dt <- ships_dt[!is.na(LAT)][!is.na(LON)]
  • NA Rows removed: 0

We can bring the data in a more “tidy” format by extracting a vessels table. In that sense each row is an observation and each column in a variable while each type of observational unit is a table (a ports table is also possible to be extracted but we left for future work).

In the vessels table we will store invariant columns with respect to geolocation and time.

vessel_vars <- c("SHIPNAME", "LENGTH", "SHIPTYPE", "WIDTH", "FLAG", "ship_type")
vessels_dt <- ships_dt[, c("SHIP_ID", vessel_vars), with = FALSE] %>%
  unique
  • Discarding vars from AIS table
ships_dt <- ships_dt[, -vessel_vars, with = FALSE]
  • SHIP_ID should be unique in vessels table. However, this is not the case.

Inspecting cases,

dup_ids <- vessels_dt[, table(SHIP_ID) > 1] %>%
  which %>%
  names %>% 
  as.integer

vessels_dt[SHIP_ID %in% dup_ids][order(SHIP_ID)] %>%
  head
##    SHIP_ID           SHIPNAME LENGTH SHIPTYPE WIDTH FLAG ship_type
## 1:  158315 R 355 TRINE LOUISE     20        2     6   DK   Fishing
## 2:  158315 R 355 TRINE LOUISE     19        2     6   DK   Fishing
## 3:  315731               ODYS     35        3     9   PL       Tug
## 4:  315731               BBAS     35        3     9   PL       Tug
## 5:  315950           .WLA-311     24        2     6   PL   Fishing
## 6:  315950            WLA-311     24        2     6   PL   Fishing

There is some noise in the data and vessels look very similar for each duplicated id.

  • Thus, we keep the first observation for each vessel id.
vessels_dt <- vessels_dt[, head(.SD, 1), by = "SHIP_ID"]
testthat::expect_false(any(vessels_dt[, table(SHIP_ID) > 1]))
  • Also name [SAT-AIS] looks like a discrepancy in the data so we remove it
vessels_dt <- vessels_dt[SHIPNAME != "[SAT-AIS]"]
vessels_dt %>%
  head
##    SHIP_ID  SHIPNAME LENGTH SHIPTYPE WIDTH FLAG ship_type
## 1:    2764    KAROLI    100        7    14   MT     Cargo
## 2:    3338     KERLI     89        7    13   MT     Cargo
## 3:    3615 FINNKRAFT    162        7    20   FI     Cargo
## 4:    5628  FINNPULP    187        7    26   FI     Cargo
## 5:    6223      MERI    105        7    18   FI     Cargo
## 6:    6333   BALTICO    169        8    24   SE    Tanker

Longest distance calculation

  • Pre-compute for each vessel the longest distance traveled between two consecutive observations.

  • Sort with respect to SHIP_ID and DATETIME.

  • Take the lag vectors for LAT and LON.

  • Calculate Karney distance between LON, LAT and their respective lag vectors (accuracy ~ 1mm).

  • Get max value(s) for each SHIP_ID.

  • Get latest max value to cover the edge case of a vessel moving exactly the same amount of meters.

ships_dt <- ships_dt[order(SHIP_ID, DATETIME)]
ships_dt[, `:=`(fromLAT = shift(LAT), fromLON = shift(LON), fromTIME = shift(DATETIME)), by = "SHIP_ID"]

For geodistance calculation we will use the lightweight and efficient geodist library and we will use Karney distance (“Algorithms for geodesics” J Geod 87:43-55) since it is accurate and for large distances.

ships_dt[, DIST := geodist(cbind(LON, LAT), cbind(fromLON, fromLAT), paired = TRUE, measure = "geodesic")]

ships_dt[, DIST := round(DIST, 3)] #karney distance has accuracy less than 1mm!

ships_long_dist <- ships_dt[
  !is.na(DIST)][
    order(SHIP_ID, DATETIME, DIST)][
      , tail(.SD, 1), by = "SHIP_ID"]
  • Merging with vessels data keeping all vessel ids even if a max distance is not retrieved.
vessels_dt <- merge(vessels_dt, ships_long_dist, on = "SHIP_ID", all.x = TRUE)
  • Calculating time difference between two consecutive observations,
vessels_dt[, delta_tau := DATETIME - fromTIME]

Last Location

If no vessel is selected in the app then display all vessels with respect to their latest location.

last_signal <- 
  ships_dt[order(-DATETIME, SHIP_ID), head(.SD, 1), by = "SHIP_ID", .SDcols = c("LAT", "LON")]
setnames(last_signal, c("LON", "LAT"), c("lastLON", "lastLAT"))

vessels_dt <- merge(vessels_dt, last_signal, all.x = TRUE)

Tidying vessels data

Cleaning step to prepare data for shiny.

Missing location

For some vessels we get only one observation. This will cause a discrepancy for displaying consecutive distances.

Although they are treated by the shiny server function, we remove them proactively.

vessels_dt <- vessels_dt[!is.na(LAT) | is.na(LON)]

Preparing timeseries data

Data for plotly visualization.

  • Calculate mean distance between concurrent observations by vessel type and date.
ts_data <- merge(
  ships_dt[!is.na(DIST), .(date, SHIP_ID, DIST)], 
  vessels_dt[, .(SHIP_ID, SHIPTYPE, ship_type)],
  by = "SHIP_ID"
)
ts_data <- ts_data[, .(date_dist = mean(DIST)), by = c("date", "SHIPTYPE", "ship_type")]

Vessels Data

Exposed data are easier to inspect and helps in the development process and debugging of the app later on.

DT::datatable(
  vessels_dt,
  extensions = 'Buttons',
  options = list(
    scrollX = TRUE,
    dom = 'Bfrtip',
    buttons = c('copy', 'csv', 'excel', 'pdf', 'print')
  )
)

Timeseries Data

DT::datatable(
  ts_data,
  extensions = 'Buttons',
  options = list(
    scrollX = TRUE,
    dom = 'Bfrtip',
    buttons = c('copy', 'csv', 'excel', 'pdf', 'print')
  )
)

Save data

Vessels data

fwrite(vessels_dt, "vessels.csv")
  • Dimensions: 1191, 27
  • File size: 230.2 Kb

Timeseries data

fwrite(ts_data, "ts_data.csv")
  • Dimensions: 72, 4
  • File size: 2.8 Kb

Saving current unix time for reactive polling

Sys.time() %>%
  as.numeric %>%
  as.character %>%
  writeLines("timestamp.txt")

Linked data

Total time

Report’s render time: 22 sec

Session Info

sessionInfo()
## R version 4.0.4 (2021-02-15)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04 LTS
## 
## Matrix products: default
## BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] geodist_0.0.7     fst_0.9.4         ggplot2_3.3.3     googledrive_1.0.1
## [5] magrittr_2.0.1    data.table_1.13.6
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.6        pillar_1.4.7      bslib_0.2.4       compiler_4.0.4   
##  [5] jquerylib_0.1.4   tools_4.0.4       testthat_3.0.2    bit_4.0.4        
##  [9] digest_0.6.27     gtable_0.3.0      jsonlite_1.7.2    evaluate_0.14    
## [13] lifecycle_1.0.0   tibble_3.0.6      gargle_1.1.0      pkgconfig_2.0.3  
## [17] rlang_0.4.11      cli_2.5.0         DBI_1.1.1         rstudioapi_0.13  
## [21] crosstalk_1.1.1   parallel_4.0.4    curl_4.3          yaml_2.2.1       
## [25] xfun_0.21         withr_2.4.2       dplyr_1.0.4       stringr_1.4.0    
## [29] httr_1.4.2        knitr_1.31        htmlwidgets_1.5.3 generics_0.1.0   
## [33] vctrs_0.3.6       sass_0.3.1        fs_1.5.0          askpass_1.1      
## [37] rappdirs_0.3.3    DT_0.17           bit64_4.0.5       grid_4.0.4       
## [41] tidyselect_1.1.0  glue_1.4.2        R6_2.5.0          rmarkdown_2.7    
## [45] waldo_0.2.4       purrr_0.3.4       scales_1.1.1      ellipsis_0.3.2   
## [49] htmltools_0.5.1.1 assertthat_0.2.1  colorspace_2.0-0  stringi_1.5.3    
## [53] munsell_0.5.0     openssl_1.4.3     crayon_1.4.1